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ABSTRACT 



The recent development of high-quality voice recognition 
software greatly facilitates the production of transcriptions for research 
and allows for objective and full transcription as well as annotated 
interpretation. Commercial speech recognition programs that are appropriate 
for generating transcriptions are available from a number of vendors, with 
varying degrees of difficulty in use. There are two fundamental approaches to 
using speech recognition to produce transcriptions: (1) real-time; and (2) 

batch. The real-time approach uses speech recognition while the interview is 
in progress; the batch approach relies on an audio recording to make it 
possible to process several interviews in a batch. The use of speech 
transcription still requires the use of a human transcriptionist , and the 
best that can be achieved for transcription speed is a mere doubling of the 
interview time. However, when the user is not a skilled typist, considerable 
savings of time can be achieved. If a researcher finds voice recognition 
software to be superior to conventional typing approaches for transcribing 
interviews, he or she is likely to find it useful for other tasks as well. 
(Contains 21 references.) (SLD) 
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The Role of Transcriptions In Qualitative Research 

Many approaches to qualitative research employ interview techniques. Morse (1994) named participant 
observation, phenomenology, ethnography, grounded theoiy^, and ethno science as examples. Spe^ng at 
an early stage in the rise of qualitative methodology for educational research, Erickson (1986) stated that 
fully half of the literature produced concerning qualitative research dealt with data interpretation. 
However, opinions remain divided as to whether interview notes need to be transcribed as a part of the 
interpretation process (see Poland, 1995; Huberman & Miles, 1994; and Patton, 1990 for supporting 
arguments; Mischler, 1991 for opposing arguments; and Seidman, 1998 for alternative suggestions). 

Although transcription remains a necessary fundamental element for methodologies involving content 
analysis (see Richards & Richards, 1994), we note that the majority of recent writers concerned with 
qualitative interpretation have moved away from generating full transcriptions. Without stepping into the 
argument of the "myth of objective transcription" (Green, Franquiz, & Dixon, 1997; see also Lapadat & 
Lindsay, 1998; Psathas & Anderson, 1990; and Denzin, 1989), this paper argues that the recent 
development of high-quality voice recognition software greatly facilitates the production of transcriptions 
and allows for both objective, full transcription as well as annotated interpretation. 

Authors discussing the interpretation of transcripts describe the generation process as "labor intensive" 
(Patton, 1990; see also Seidman, 1998; Kvale, 1996; Huberman & Miles, 1994; and Miles & Huberman, 
1994). Estimates of the time needed to generate transcripts range from a factor of 2.5 to 4 times the length 
of the recording being transcribed (Seidman, 1998 and Patton, 1990) up to 4 to 8 times, depending on the 
fineness of detail and familiarity with content (Miles & Huberman, 1994). After acknowledging the 
difiRculty of generating transcriptions, the authors are then silent as to how the process should occur 
(although Seidman, 1998 suggests hiring another person). This paper postulates that one of the 
unacknowledged reasons for the shift away from transcription generation is the large investment in time 
and energy that they represent. 

Nevertheless, electronic transcriptions remain the most reusable resource for qualitative research: 
transcribed interviews can be searched, re-interpreted, collected and shared with other researchers in a more 
meaningful way and with much greater efficiency than any other medium. Moreover, the availability of 
transcriptions can be seen to more readily allow audit trails (Janesick, 1994; Lincoln & Cuba, 1985), 
member checks (Lincoln & Cuba, 1985), and support of alternative explanations (Patton, 1990), all 
techniques that are seen as useful in establishing interpretive validity (Altheide & Johnson, 1994; Atkinson, 
1992). 

Recently, the possibilities presented by speech recognition software been considered in generating 
interview transcripts (Fogg & Dzuik, 1999; Levitt, 1998) leading to more efficient generation of interview 
transcripts, and representing a savings in time and effort. Previously, most discussions on computer uses for 
qualitative interpretation have focused on storing and sorting evidence (Richards & Richards, 1995; 
Huberman & Miles, 1994; Tesch, 1990). In this paper we outline the logistical difficulties of producing 
interview transcripts and explore several ways in which speech recognition software can address them. In 
addition, we briefly examine several other ways in such software can facilitate qualitative research and 



identify areas in which speech recognition technology needs to be improved. 



Characteristics and Capabilities of Speech Recognition 

Commercially speech recognition programs that are appropriate for generating transcriptions are available 
from a number of vendors such as Lemout and Hauspie, IBM, and Dragon Systems. ^ 

designed to recognize continuous speech (normal, fluent speech without reqmnng pauses betw^n words) 
tioflSe vocables (tens of thousands of words). The performance of these programs is ^gl^ vmable 
and critically dependent upon several variables: (1) the capabilities of the computer on w^ch th^ are 
installed. (2) the microphone quality, (3) the background noise enviroiunent and (4) the 
the models used by the software reflect the speech being recogmzed. When these variables are properl> 
controlled, recognition accuracies of better than 95% can be obtained. 



Computer Hardware Requirements 

Each speech recognition vendor specifies minimal requirements for the computer to be u^ with t eir 
product It must be recognized that these are typically the absolute minimums and that there ^e go^ 
reasons to significantly exceed these requirements. Because speech recogmuon reqmres searching through 
very large da^ltructures, the speed of the recognition will be directly affected not only by the processor 
sp2d but also by the amount of memory that is available. Typically, doubling the recommend^ amount of 
iSry will result in very noticeably faster responses. A leading VR vendor, for exampl j in 1999 
recominended a 200 MHz processor speed with 64 MB of memory; simultaneously, a leading commeraa 
trainer for VR software was finding a 333 MHz processor with 96 MB RAM necessary to achieve desirable 
performance with this same software (Fogg & Dzuik, 1999). 

The other area in which "more is better" is disk space. Digitized speech can consume upward^f a ^ 
megabyte per minute. That is, a couple of hours of recordings can easily occupy several hmufred ™egab^es 
of disk space. Compression techniques, commonly used to greatly reduce the size of audio files sen 
web (such as MP3) create distortions that, even when inaudible to human ears, can increase the recogmtion 

error rate by an order of magnitude. 



Microphone Quality and Background Noise Environment 

The quality of the microphone is extremely important. Because any distortion in the speech signal can 
intrc^uce recognition errors, it is essential that the microphone be of high quality^In some cases even the 
headset microphone provided by the vendor to accompany the software is seen to be inadequate for op m 
use (Fogg & Dzuik, 1999). Moreover, speech recognition software has only the most element^ capability 
to separate speech from background noises so it is essential that the microphone captme as little 
background noise as possible. The best recognition results are obtained when using a head^t-mounte^ 
noise-canceling microphone in a quiet room. Recordings made with a cheap mcrophone placed on a table 
between two speakers (as is typical in an interview situation) are likely to produce recogmuon error rates in 

excess of 50%. 



Model Training 

Mathematical models drive speech recognition software. The best recognition accuracy results when these 
models are closely tuned to the speaker's voice and word usage patterns. The potential user must recogmze 
the necessity for training a model for each speaker's voice and customizing the vocabulary and word usage 
patterns for the topics being discussed. ConvenUonal estimates of the time needed to tram software for the 
voice model have centered around 10 hours. Newer versions of recogniUon software promote an enrollment 
period of only 5 minutes for initial voice recogniUon by the software, but this low estimate, as mght be 
expected, is accompanied by a high error rate. The user is encouraged to have patience through this 
learning process; the system will become more efficient and accurate in recognizing voice patterns after 
about five hours of use. 
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Handling Punctuation And Disfluencies 

In the view of the authors, the single greatest impediment to widespread use of sp>eech recognition for 
transcription is the inability of the software to automatically generate punctuation. In all current speech 
recognition products, it is necessary to speak the punctuation characters ("period", "comma", "question 
mark") if they are to appear in the output text. This necessitates training the system to recognize the 
punctuation commands and for the user to become practiced in speaking punctuation. 

Disfluencies such as word fragments and filled pauses (eg.,"umm", "err") will generally cause recognition 
errors: the software will attempt to recognize them as legitimate dictionary words. Consequently, users of 
speech recognition systems need to be trained not to use filled pauses and to formulate their thoughts so 
that they can speak fluently once they start to speak. Silence is reliably recognized so it is far better to 
dictate in short, fluent bursts separated by large silences than to fill the silences with speech that shouldn't 
be transcribed. This is very different from the way in which we speak to other people, and some practice is 
required to learn this discipline. 

Using Speech Recognition Software For Transcription 

Fundamentally, there are two approaches to using speech recognition to produce transcriptions of an 
interview: real-time and batch. The real-time approach seeks to use speech recognition while the interview 
is in progress to produce a transcript. This approach would be similar to the use of a stenographer and does 
not require the creation of an audio recording. The batch approach relies on an audio recording of the 
interview and attempts to create a transcript after-the-fact. The term "batch" comes from the fact that 
several interview recordings can be processed together in a batch. 



Real-Time Transcription 

Real-time transcription would occur if a qualitative researcher could carry on a conversation with one or 
more interviewees and an accurate electronic transcription could be generated at that same time. This 
scenario would clearly have advantages for immediate member-checking and faster analysis of content. 
However, the present state-of-the-art is not capable of supporting this application. Current software is 
designed for recognizing dictations produced by a single speaker, not dialogues. Add to this the need for 
very quiet background enviroiunents, the intrusiveness of requiring the interviewee to wear a headset- 
mounted microphone, the recognition errors created by word fragments and filled pauses, and the need for 
each speaker to complete several hours of emollment training to tune the recognition models and the 
difficulties become far greater than can be justified. Better to concentrate on simply obtaining a good audio 
recording. 



Batch Mode Transcription 

In this mode, the transcript is to be produced from a recording of the interview. This can be done in one of 
two ways: either the speech recognition software can be used to process the recording itself with a human 
transcriptionist then editing the resulting output, or the transcriptionist can use the speech recognition 
software directly and re-dictate the interview in a process known as "ghosting". Attempting to use the 
recognition software to directly process the audio recording is subject to all of the same limitations 
associated with real-time transcription. Nevertheless, doing so can accomplish some transcription that, 
despite a very high error rate, can show the range of the interview and may be of some value in helping to 
prepare a more accurate transcription. 

The most reliable method for generating a transcript is for the transcriptionist to simply re-dictate a 
recorded interview. The transcriptionist listens to the recorded interview and then repeats the sentences, 
including punctuation, into a high quality microphone. Because the only voice being recognized is that of 
the transcriptionist (who can go through the full enrollment and model training process), and the re- 
dictation can be done in a quiet environment, the recognition accuracy can be quite high. Typically, the 
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Summary and Conclusions 

Because of the methodological requirements for interpretive validity, the need for member checking, and 
desirability of audit trails, we believe that more consistent production of data transcripts is needed for 
qualitative studies. These transcripts represent primaiy data that could and should be shared and pooled 
amongst qualitative researchers, once issues of confidentiality are resolved. Our strong contention is that 
the adoption of VR techniques will allow workers with good subject knowledge but poor typing skills 
(graduate students or qualitative researchers) to operate at the high ends of productivity achieved by 
conventional transcriptionists. Economic savings can thus be realized and researchers would benefit from a 
more intimate experience of their data. 



Technical Challenges 

Clearly, the biggest technical problem associated with speech recognition software is providing for 
punctuation. The inability of the software programs to interpret verbal punctuation such as pauses and 
inflections (collectively c^led "prosody”), and resulting requirement that all punctuation must be explicitly 
dictated, limits the utility of the software. Neither is software technology at the point where it can 
effectively use artificial intelligence heuristics to infer the correct punctuation of spoken conversation. This 
limitation, in combination with the requirements for high-quality recordings in quiet environments, 
effectively limits the use of speech recognition to re-dictation or ghosting. 



Cognitive Aspects And Productivity Gains 

Voice recognition software involves the cognitive aspects of switching modalities (hearing to speech vs. 
hearing to typing). The types of cognitive processes involved suggest the possible involvement or 
interaction with cognitive style (auditory or visual). The cognitive processes involved with voice 
recognition software indicate that reading abilities and memory capacities are involved, and we know that 
these abilities are found in different measure among users. One claim that needs to be validated is that 
examination of a printed transcript, with its opportunity for recursive reading, allows for more thorough and 
more insightful analysis of content. We also need to develop a systematic comparison of transcription 
speeds for professional typists versus speech recognition programs used by qualitative researchers. 

Enhanced Productivity for Qualitative Research 

(Qualitative research is often derided for drawing conclusions from very limited sets of interviews. By 
recognizing interview transcripts as a primary data source, producing transcripts more consistently, and 
pooling them across groups of qualitative researchers, the authors hope for an enhanced public estimation 
of the value of qualitative research. By improving the productivity and reducirig the cost associated with the 
production of transcripts, speech recognition should a valuable addition to ftie qualitative researcher’s 
arsenal. 
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